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Abstract 

Recent studies have demonstrated the power of recurrent neural networks for machine 
translation, image captioning and speech recognition. For the task of capturing temporal 
structure in video, however, there still remain numerous open research questions. Current 
research suggests using a simple temporal feature pooling strategy to take into account the 
temporal aspect of video. We demonstrate that this method is not sufficient for gesture 
recognition, where temporal information is more discriminative compared to general video 
classification tasks. We explore deep architectures for gesture recognition in video and 
propose a new end-to-end trainable neural network architecture incorporating temporal 
convolutions and bidirectional recurrence. Our main contributions are twofold; first, 
we show that recurrence is crucial for this task; second, we show that adding temporal 
convolutions leads to significant improvements. We evaluate the different approaches on 
the Montalbano gesture recognition dataset, where we achieve state-of-the-art results. 


1 Introduction 

Gesture recognition is one of the core components in the thriving research field of human- 
computer interaction. The recognition of distinct hand and arm motions is becoming in¬ 
creasingly important, as it enables smart interactions with electronic devices. Furthermore, 
gesture identification in video can be seen as a first step towards sign language recognition, 
where even subtle differences in motion can play an important role. Some examples that 
complicate the identification of gestures are changes in background and lighting due to the 
varying environment, variations in the performance and speed of the gestures, different clothes 
worn by the performers and different positioning relative to the camera. Moreover, regular 
hand motion or out-of-vocabulary gestures should not to be confused with one of the target 
gestures. 

Convolutional neural networks (CNNs) (LeCun et ah, 1998) are the de facto standard approach 
in computer vision. CNNs have the ability to learn complex hierarchies with increasing levels 
of abstraction while being end-to-end trainable. Their success has had a huge impact on 
vision based applications like image classification (Krizhevsky et ah, 2012), object detection 
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(Sermanet et al., 2013), human pose estimation (Toshev & Szegedy, 2014) and many more. 
A video can be seen as an ordered collection of images. Classifying a video frame by frame 
with a CNN is bound to ignore motion characteristics, as there is no integration of temporal 
information. Depending on the task at hand, aggregating the spatial features produced by the 
CNN with temporal pooling can be a viable strategy (Karpathy et ah, 2014; Ng et ah, 2015). 
As we’ll show in this paper, however, this method is of limited use for gesture recognition. 

Apart from a collection of frames, a video can also be seen as a time series. Some of the most 
successful models for time series classification are recurrent neural networks (RNNs) with 
either standard cells or long short-term memory (LSTM) cells (Hochreiter &: Schmidhuber, 
1997). Their ability to learn dynamic temporal dependencies has allowed researchers to achieve 
breakthrough results in e.g. speech recognition (Graves et ah, 2013), machine translation 
(Sutskever et ah, 2014) and image captioning (Vinyals et ah, 2015). Before feeding video to 
recurrent models, we need to incorporate some form of spatial or spatiotemporal feature ex¬ 
traction. This motivates the concept of combining CNNs with RNNs. CNNs have unparalleled 
spatial (and spatiotemporal with added temporal convolutions) feature extraction capabilities, 
while adding recurrence ensures the modeling of feature evolution over time. 

For general video classification datasets like UCF-101 (Soomro et ah, 2012), Sports-IM 
(Karpathy et ah, 2014) or HMDB-51 (Kuehne et ah, 2011), the temporal aspect is of less 
importance compared to a gesture recognition dataset. For example, the appearance of a 
violin almost certainly suggests the target class is “playing violin”, as no other class involves 
a violin. The model has no need to capture motion information for this particular example. 
That being said, there are some categories where modeling motion in some way or another is 
always beneficial. In the case of gesture recognition, however, motion plays a more critical 
role. Many gestures are not only defined by their spatial hand and/or arm placement, but 
also by their motion pattern. 

In this work, we explore a variety of end-to-end trainable deep networks for video classification 
applied to frame-wise gesture recognition with the Montalbano dataset that was introduced in 
the ChaLearn LAP 2014 Challenge (Escalera et ah, 2014). We study two ways of capturing the 
temporal structure of these videos. The first method involves temporal convolutions to enable 
the learning of motion features. The second method introduces recurrence to our networks, 
which allows the modeling of temporal dynamics, which plays an essential role in gesture 
recognition. 


2 Related Work 

An extensive evaluation of CNNs on general video classification is provided by Karpathy 
et ah (2014) using the Sports-IM dataset. They compare different frame fusion methods to a 
baseline single-frame architecture and conclude that their best fusion strategy only modestly 
improves the accuracy of the baseline. Their work is extended by Ng et ah (2015), who show 
that LSTMs achieve no improvements over a temporal feature pooling scheme on the UCF-101 
dataset for human action classification and only marginal improvements on the Sports-IM 
dataset. For this reason, the single-frame and the temporal pooling architectures are important 
baseline models. 

Another way to capture motion is to convert a video stream to a dense optical flow. This is a 
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way to represent motion spatially by estimating displacement vectors of each pixel. It is a core 
component in the two-stream architecture described by Simonyan & Zisserman (2014) and is 
used for human pose estimation (Jain et ah, 2014), for global video descriptor learning (Ng 
et ah, 2015) and for video captioning (Venugopalan et ah, 2015). We have not experimented 
with optical flow, because (i) it has a greater computational preprocessing complexity and (ii) 
our models should implicitly learn to infer motion features in an end-to-end fashion, so we 
chose not to engineer them. 

Neverova et ah (2014) present an extended overview of their winning solution for the ChaLearn 
LAP 2014 gesture recognition challenge and achieve a state-of-the-art score on the Montalbano 
dataset. They propose a multi-modal ‘ModDrop’ network operating at three temporal scales 
and use an ensemble method to merge the features at different scales. They also developed a 
new training strategy, ModDrop, that makes the network’s predictions robust to missing or 
corrupted channels. 

Most of the constituent parts in our architectures have been used before in other work for 
different purposes. Learning motion features with three-dimensional convolution layers has 
been studied by Ji et ah (2013) and Taylor et ah (2010) to classify short clips of human actions 
on the KTH dataset. Baccouche et ah (2011) proposed including a two-step scheme to model 
the temporal evolution of learned features with an LSTM. Finally, the combination of a CNN 
with an RNN has been used for speech recognition (Hannun et ah, 2014), image captioning 
(Vinyals et ah, 2015) and video narration (Donahue et ah, 2015). 


3 Architectures 

In this section, we briefly describe the different architectures we investigate for gesture 
recognition in video. An overview of the models is depicted in Figure 1. Note that we pay 
close attention to the comparability of the network structures. The number of units in the 
fully connected layers and the number of cells in the recurrent models are optimized based on 
validation results for each network individually. All other hyper-parameters mentioned in this 
section and in Section 4.2 are optimized for the temporal pooling architecture. As a result, 
improvements over our baseline models are caused by architectural differences rather than 
better optimization, other hyper-parameters or preprocessing. 

3.1 Baseline Models 

Single-Frame The single-frame architecture (Figure la) worked well for general video 
classification (Karpathy et ah, 2014), but is not a very fitting solution for our frame-wise 
gesture recognition setting. Nevertheless, this will give us an indication on how much static 
images contribute to the recognition. It has 3x3 convolution kernels in every layer. Two 
convolutional layers are stacked before performing max-pooling on non-overlapping 2x2 spatial 
regions. The shorthand notation of the full architecture is as follows: C{1&) - C{16) - P - 
C(32) - C(32) - P - C(64) - C(64) - P - C(128) - C(128) - P - D(2048) - D(2048) - S, where 
C{nc) denotes a convolutional layer with Uc feature maps, P a max-pooling layer, D{nd) a 
fully connected layer with Ud units and S a softmax classifier. We deploy leaky rectified linear 
units (leaky ReLUs) in every layer. Their activation function is defined as a : x i—)■ max(Q;x,x), 
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Figure 1: Overview (a) Single-frame CNN architecture, (b) Temporal feature pooling 
network (max- or mean-pooling), spanning multiple video frames, (c) Model with bidirectional 
recurrence, (d) Adding temporal convolutions and three-dimensional max-pooling (MP refers to 
max-pooling), (e) Architecture with added temporal convolutions and bidirectional recurrence. 


where a = 0.3. Leaky ReLUs seemed to work better than conventional ReLUs and showed 
promising results in other work (Maas et ah, 2013; Graham, 2014; Dieleman et ah, 2015; Xu 
et ah, 2015). 

Temporal Feature Pooling The second baseline model exploits a temporal feature pooling 
strategy. As suggested by Ng et ah (2015), we position the temporal pooling layer right 
before the first fully connected layer as illustrated in Figure lb. This layer performs either 
mean-pooling or max-pooling across all video frames. The structure of the CNN-component is 
identical to the single-frame model. This network is able to collect all the spatial features in a 
given time window. However, the order of the temporal events is lost due to the nature of 
pooling across frames. 


3.2 Bidirectional Recurrent Models 

The core idea of RNNs is to create internal memory to learn the temporal dynamics in 
sequential data. An issue (in our case) with conventional recurrent networks is that their states 
are built up from previous time steps. A gesture, however, generally becomes recognizable 
only after a few time steps, while the frame-wise nature of the problem requires predictions 
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from the very first frame. This is why we use bidirectional recurrence, which enables us to 
process sequences in both temporal directions. 

Describing the proposed model (Figure Ic) formally, we start with the CNN (identical to the 
single-frame model) transforming an input frame Xt to a more compact vector representation 
vt- 


vt = CNN(xt). (1) 

A bidirectional RNN computes two hidden sequences: the forward hidden sequence and 

the backward hidden sequence : 

='Hf{vt,h!'/_}i) and (2) 

h® = (3) 


where % represents a recurrent layer and depends on the type of memory cell. There are 
two different cell types in widespread use: standard cells and LSTM cells (Hochreiter &: 
Schmidhuber, 1997) (we use the modern LSTM cell structure with peephole connections (Gers 
et ah, 2003)). Both cell types will be compared in this work. Finally, the output predictions 
Ut are computed with a softmax classifier which takes the sum of the forward and backward 
hidden states as input: 

Ut = softmax(lTy(/i|-^^ + hf^) + by). (4) 


3.3 Adding Temporal Convolutions 

Our final set of architectures extends the CNN layers with temporal convolutions (convolutions 
over time). This enables the extraction of hierarchies of motion features and thus the capturing 
of temporal information from the first layer, instead of depending on higher layers to form 
spatiotemporal features. Performing three-dimensional convolutions is one approach to achieve 
this. However, this leads to a significant increase in the number of parameters in every layer, 
making this method more prone to overfitting. Therefore, we decide to factorize this operation 
into two-dimensional spatial convolutions and one-dimensional temporal convolutions. This 
leads to fewer parameters and optionally more nonlinearity if one decides to activate both 
operations. We opt to not include a bias or another nonlinearity in the spatial convolution 
step to maintain the comparability between architectures. 

First, we compute spatial feature maps st for every frame xt- A pixel at position (i,j) of the 
fc-th feature map is determined as follows: 

N 

4' = E«5*T>)... (5) 

n=l 

where N is the number of input channels and VFspat are trainable parameters. Finally, we 
convolve across the time dimension for every position {i,j), add the bias 6*-^^ and apply the 
activation function a: 

4’= + E (<5* 4”"),) ■ 

\ m=l / 
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where the variables Wtemp and b are trainable parameters and M is the number of spatial 
feature maps. 

Two different architectures are proposed using this new layer. In the first model (Figure 
Id), we replace the convolutional layers of the single-frame CNN with the spatiotemporal 
layer defined above. Furthermore, we apply three-dimensional max-pooling to reduce spatial 
as well as temporal dimensions while introducing slight translational invariance in time. 
Note that this architecture implies a sliding window approach for frame-wise classification, 
which is computationally intensive. In the second model, illustrated in Figure le, the time 
dimensionality is retained throughout the network. That means we only carry out spatial max¬ 
pooling. To this end, we are able to stack a bidirectional RNN with LSTM cells, responding 
to high-level temporal dependencies. It also incidentally resolves the need for a sliding window 
approach to implement frame-wise video classification. 


4 Experiments 

4.1 Montalbano Gesture Recognition Dataset 

The ChaLearn Looking At People (LAP) 2014 Challenge (Escalera et ah, 2014) consists of three 
tracks: human pose recovery, human action/interaction recognition and gesture recognition. 
The dataset accompanying the gesture recognition challenge, called the Montalbano dataset, 
will be used throughout this work. The dataset is multi-modal, because the gestures are 
captured with a Microsoft Kinect that has a depth sensor. In all sequences, a single user is 
recorded in front of the camera, performing natural communicative Italian gestures. Each 
data file contains an RGB-D (where “D” stands for depth) image sequence and a skeletal pose 
stream provided by the Microsoft Kinect API. The gesture vocabulary contains 20 Italian 
cultural/anthropological signs. The gestures are not segmented, which means that sequences 
typically contain several gestures. Gesture performances appear randomly within the sequence 
without a prearranged rest pose. Moreover, several unannotated out-of-vocabulary gestures 
are present. 

It is the largest publicly available gesture dataset of its kind. There are 1,720,800 labeled 
frames across 13, 858 video fragments of about 1 to 2 minutes sampled at 20Hz with a resolution 
of 640x480. The gestures are performed by 27 different individuals under diverse conditions; 
these include varying clothes, positions, backgrounds and lighting. The training set contains 
11,116 gestures and the test set contains 2, 742. The class imbalance is negligible. The starting 
and ending frames for each gesture are annotated as well as the gesture class label. 

To speed up the training, we crop part of the images containing the user and rescale them to 
64 by 64 pixels using the skeleton information (other than that, we do not use any pose data). 
However, we show in Section 4.3 that we even achieve good results when we do not crop the 
images and leave out depth information. 
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4.2 End-To-End Training 


We train our models from scratch in an end-to-end fashion, backpropagating through time 
(BTT) for our recurrent architectures. The network parameters are optimized by minimizing 
the cross-entropy loss function using mini-batch gradient descent with the Adam update 
rule (Kingma & Ba, 2015). We found that Adam works great in practice, especially when 
experimenting with very different layer types in the same model. All our models are trained 
the same way with early stopping, a mini-batch size of 32, a learning rate of 10“^ and an 
exponential learning rate decay. Before training, we initialize the weights with a random 
orthogonal initialization method (Saxe et ah, 2013). 

Recurrent Networks As described in Section 4.1, the video files in the Montalbano dataset 
contain approximately 1 to 2 minutes of footage, consisting of multiple gestures. Recurrent 
models are trained on random fragments of 64 frames and produce 64 predictions, one for 
every frame. To summarize, a data sample has 4 channels (RGB-D), 64 frames each, with 
a resolution of 64 by 64 pixels; or in shorthand notation: 4@64x 64x64. We optimized the 
number of cells for each model based on validation results. For LSTM cells, we only saw a 
small improvement between 512 and 1024 units, so we settled at 512. For RNNs with standard 
cells, we used 2048 units. The location of gestures within the long sequences is not given. A 
gesture is generally about 20 to 50 frames long. If a small fraction of a gesture is located 
at the beginning or the end of the 64 considered frames, the model does not have enough 
information to label these frames correctly. That is why we allow a buildup in both forward 
and backward direction for evaluation; we feed 64 frames into the RNN and keep the middle 
32 for evaluation. 

Non-Recurrent Networks The single-frame CNN is trained frame by frame and all other 
non-recurrent networks are trained with the number of frames optimized for their specific 
architecture. The best number of frames to mean-pool across is 32, determined by validation 
scores with tested values in [8,16,32,64]. In the case of max-pooling, we find that pooling 
over 16 frames gives better outcomes. Also, pretraining the CNNs frame-by-frame and 
fine-tuning with temporal max-pooling gave slightly improved results. We observed no 
improvements, however, using this technique with temporal mean-pooling. The architecture 
with added temporal convolutions and three-dimensional max-pooling showed optimal results 
by considering 32 surrounding frames. The targets for all the non-recurrent networks are the 
labels associated with the centermost frame of the input video fragment. We evaluate these 
models using a sliding window with single-frame steps. 

Regularization and Data-Augmentation We employed many different methods to regu¬ 
larize the deep networks. Data augmentation has a significant impact on generalization. For 
all our trained models, we used the same augmentation parameters: [—5, 5] pixel translations 
in vertical direction and [—10,10] horizontal, [—2,2] rotation degrees, [—2,2] shearing degrees, 
[j^, 1.1] image scaling factors and [j^, 1.2] temporal scaling factors. From each of these inter¬ 
vals, we sample a random value for each video fragment and apply the transformations online 
using the CPU. Dropout with p = 0.5 is used on the inputs of every fully connected layer. Fur¬ 
thermore, using leaky ReLUs instead of conventional ReLUs and factorizing three-dimensional 
convolutions into spatial and temporal convolutions also reduce overfitting. 
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Architecture 

Jaccard Index 

Precision 

Recall 

Error Rate* 

Single-Frame CNN (Figure la) 

0.465 

67.86% 

57.57% 

20.68% 

Temp Max-Pooling (Figure lb) 

0.748 

85.03% 

82.92% 

8.66% 

Temp Mean-Pooling (Figure lb) 

0.775 

85.93% 

85.80% 

8.55% 

Temp Conv (Figure Id) 

0.842 

89.36% 

90.15% 

4.67% 

RNN, Standard Cells (Figure Ic) 

0.885 

92.77% 

93.56% 

3.58% 

RNN, LSTM Cells (Figure Ic) 

0.888 

93.75% 

93.28% 

3.55% 

Temp Conv -|- LSTM (Figure le) 

0.906 

94.49% 

94.57% 

2.77% 


Table 1: A comparison of the results for our different architectures on the Montalbano 
gesture recognition dataset. The Jaccard index indicates the mean overlap between the binary 
predictions and the binary ground truth across gesture categories. We also compute precision 
and recall scores for each gesture class and report the mean score across classes. 

*The error rate is based on majority voted frame-wise predictions from isolated gesture 
fragments. 


4.3 Results 


We follow the ChaLearn LAP 2014 Challenge score to measure the performance of our 
architectures. This way, we can compare with previous work on the Montalbano dataset. The 
competition score is based on the Jaccard index, which is defined as follows: 

T _ \^s,n C Bs^n\ 

The binary ground truth for gesture category n in sequence s is denoted as the binary vector 
As^n, whereas Bg^n denotes the binary predictions. The Jaccard index Jg^n can be seen as 
the overlap rate between Ag^n and Bg^n- To compute the final score, the mean Jaccard index 
among all categories and sequences is computed: 


“ NS 


EE 


where A^ = 20 is the number of categories and S the number of sequences in the test set. 

An overview of the results for our different architectures is shown in Table 1. The predictions 
of the single-frame baseline achieve a Jaccard index below 0.5. This is to be expected as no 
motion features are extracted. We observe a significant improvement with temporal feature 
pooling (a Jaccard index of 0.775 vs. 0.465). Furthermore, mean-pooling performs better than 
max-pooling. Adding temporal convolutions and three-dimensional max-pooling improves the 
Jaccard index to 0.842. 


The three last entries in Table 1 use recurrent networks. Surprisingly, the RNNs are only 
acting on high-level spatial features, yet are surpassing a CNN learning hierarchies of motion 
features (a Jaccard index of 0.842 vs. 0.888). The difference in performance for the two types 
of cells is very small and they can be considered equally capable for this type of problem where 
temporal dependencies are not too long-ranged. Finally, combining the temporal convolution 
architecture with an RNN using LSTM cells improves the score even more (0.906). This 
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Model 

Crop 

Depth 

Pose 

Jaccard Index 

Chang (2014) (MRF, KNN, PCA, HoC) 

yes 

no 

yes 

0.827 

Monnier et al. (2014) (AdaBoost, HoC) 

yes 

yes 

yes 

0.834 

Neverova et al. (2014) (Multi-Scale DNN) 

yes 

yes 

no 

0.836 

Neverova et al. (2014) (Multi-Scale DNN) 

yes 

yes 

yes 

0.870 


no 

no 

no 

0.842 

Temp Conv -|- LSTM 

yes 

no 

no 

0.876 


yes 

yes 

no 

0.906 


Table 2: Montalbano gesture recognition dataset results compared to previous work. Crop: 
the cropping of specific areas in the video using the skeletal information. Depth: the usage of 
depth-maps. Pose: the usage of the skeletal stream as features. Note that even when we do 
not use depth images, we still achieve better results. 


deep network not only learns multi-level spatiotemporal features, but is capable of modeling 
temporal dynamics within them. 

In Table 2, we compare our results with previous work. Our best model outperforms the 
method of Neverova et al. (2014) when we only consider RGB-D pixels as input features (0.906 
vs. 0.836). When we remove depth information and perform no preprocessing other than 
rescaling the images, we still achieve better results (0.842). Moreover, we even achieve better 
results without the need for depth images or pose information (0.876 vs. 0.870). 

To illustrate the differences in output predictions of the different architectures, we show 
them for a randomly selected sequence in Figure 2. We see that the single-frame CNN has 
trouble classifying the gestures, while the temporal pooling is significantly more accurate. 
However, the latter still has difficulties with boundaries. Adding temporal convolutions shows 
improved results, but the output contains more jagged predictions. This seems to disappear 
by introducing recurrence. The output of the bidirectional RNN matches the target labels 
strikingly well. 

In Figure 3, we show that adding temporal convolutions enables neural networks to capture 
motion information. When the user is standing still, the units of the feature map are inactive, 
while the feature map from the network without temporal convolutions has a lot of active 
units. When the user is moving, the feature map shows strong activations at the movement 
locations. This suggests that the model has learned to extract motion features. 


5 Conclusion and Future Work 

We showed in this paper that adding bidirectional recurrence and temporal convolutions 
improves frame-wise gesture recognition in video significantly. We observed that RNNs 
responding to high-level spatial features perform much better than single-frame and temporal 
pooling architectures, without the need to take into account the temporal aspect in the lower 
layers of the network. However, adding temporal convolutions in all layers of the architecture 
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Figure 2: The output probabilities are shown for a sequence fragment in the test set. The 
dashed line represents silences. The non-recurrent models make more mistakes and have 
difficulties making hard decisions to where the gesture starts or ends and are unable to smooth 
out predictions in time. Adding recurrence enables deep networks to learn the behavior of the 
manual annotators with great accuracy. 


has a notable impact on the performance, as they are able to learn hierarchies of motion 
features, unlike RNNs. Standard cells and LSTM cells appear to be equally strong for this 
problem. Furthermore, we observed that RNNs outperform non-recurrent networks and are 
able to predict the beginning and ending frames of gestures with great accuracy, whereas other 
models show uncertainty at these boundaries. 

In the future, we would like to build upon this work for research in the domain of sign language 
recognition. This is even more challenging than gesture recognition. The vocabulary is larger, 
the differences in finger positions and hand movements are more subtle and signs are context 
dependent, as they are part of a language. Sign language is not related to written or spoken 
language, which complicates annotation and translation. Moreover, signers communicate 
simultaneously with facial, manual (both hands are separate communication channels) and 
body expressions. This means that sign language video cannot be translated the way speech 
recognition can transcribe audio to written sentences. 
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Figure 3: Motion Features This figure illustrates the effect of integrating temporal con¬ 
volutions. The depicted spatial feature map is the most active 4-layer-deep feature map, 
extracted from an architecture without temporal convolutions. The spatiotemporal feature 
map is extracted from a model with temporal convolutions. The strong activations in the 
spatiotemporal feature maps while moving indicate learned motion features. 
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